Synthesizing training samples for object recognition

ABSTRACT

An enhanced training sample set containing new synthesized training images that are artificially generated from an original training sample set is provided to satisfactorily increase the accuracy of an object recognition system. The original sample set is artificially augmented by introducing one or more variations to the original images with little to no human input. There are a large number of possible variations that can be introduced to the original images, such as varying the image&#39;s position, orientation, and/or appearance and varying an object&#39;s context, scale, and/or rotation. Because there are computational constraints on the amount of training samples that can be processed by object recognition systems, one or more variations that will lead to a satisfactory increase in the accuracy of the object recognition performance are identified and introduced to the original images.

BACKGROUND

Computer vision allows computing systems to understand an image or asequence of images (e.g., video) by extracting information from theimage. The ability of a computing system to accurately detect andlocalize objects in images has numerous applications, such ascontent-based searching, targeted advertisements, and medical diagnosisand treatment. It is a challenge, however, in object recognition methodsand systems, to teach the computing system to detect and localizeparticular rigid or articulated objects in a given image.

Object recognition methods and systems operate based on a given set oftraining images that have been annotated with the location and type ofobject shown in an image. However, gathering and annotating trainingimages is expensive, time consuming, and requires human input. Forexample, images of certain object types may be gathered using textualqueries to existing image search engines that are filtered by humanlabelers that annotate the images. Such approaches are expensive orunreliable for object localization and segmentation because humaninteraction is required to provide accurate bounding boxes andsegmentations of the object. Alternatively, algorithms requiring lesstraining data may be used for object localization and segmentation. Thealgorithms identify particular invariant properties of an object togeneralize all modes of variation of the object from existing trainingdata. However, the accuracy of object recognition systems increases withthe amount of training data. Accordingly, it is a challenge to developlarge enough training sample sets to obtain satisfactory results.

SUMMARY

Implementations described and claimed herein address the foregoingproblems by providing an enhanced training sample set containing newsynthesized training images that are artificially generated from anoriginal training sample set. The original sample set is artificiallyaugmented by introducing one or more variations to the original imageswith little to no human input. There is a large number of possiblevariations that can be introduced to the original images to create a newand larger training set of images. Such variations include withoutlimitation varying an image's position, orientation, and/or appearance(e.g., brightness) and/or varying the context, scale, and/or rotation ofone or more objects in an image. Because there are computationalconstraints on the amount of training data that can be processed byobject recognition systems, one or more variations that will lead to asatisfactory increase in the accuracy of the object recognitionperformance (e.g., the highest increase in accuracy) are identified andapplied to the original images to create an enhanced training sampleset.

In some implementations, articles of manufacture are provided ascomputer program products. One implementation of a computer programproduct provides a tangible computer program storage medium readable bya computing system and encoding a processor-executable program. Otherimplementations are also described and recited herein.

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example object recognition system providing anenhanced training sample set for object processing.

FIG. 2 illustrates example components in an object recognition systemfor selecting one or more synthesizer models to output an enhancedtraining sample set.

FIG. 3 illustrates example components in an object recognition systemfor outputting object processing results based on an enhanced trainingsample set.

FIG. 4 illustrates an example model selection system for identifying oneor more synthesizer models for each object class in an objectrecognition operation.

FIG. 5 illustrates an example object recognition system for introducingvariations to an original training sample set based on one or moreglobal transform models.

FIG. 6 illustrates example original images transformed based on globaltransform models.

FIG. 7 illustrates an example object recognition system for introducingvariations to an original training sample set based on one or moreobject transform within the same image models.

FIG. 8 illustrates example original images transformed based on objecttransform within the same image models.

FIG. 9 illustrates an example object recognition system for introducingvariations to an original training sample set based on one or moreobject transform within a different background models.

FIG. 10 illustrates example original images transformed based on objecttransform within a different background models.

FIG. 11 illustrates example operations for identifying one or moresynthesized image generating models to generate a synthesized trainingsample set from an original training sample set for image processing.

FIG. 12 illustrates example operations for performing image processingon an unknown image based on training data.

FIG. 13 illustrates an example system that may be useful in implementingthe described technology.

DETAILED DESCRIPTION

FIG. 1 illustrates an example object recognition system 100 providing anenhanced training sample set for object processing. The objectrecognition system 100 includes original training sample set 102, whichis a set of training images and annotations for one or more objectclasses. The training images contained in the original training sampleset 102 may be annotated by using a manual labeling method to identifyobjects in each training image. The training images may be annotated,for example, using image-level annotation, bounding box annotation,and/or segmentation annotation. However, other annotations includingwithout limitation rotated rectangles or other shapes are contemplated.

Image-level annotation identifies an image as containing one or moreparticular objects. Bounding box annotation identifies an object asbeing enclosed within a particular rectangle in the image. Pixelsbelonging to the object may be extracted from the background usingbackground-foreground extraction methods, which may be automatic,semi-automatic, or manual methods (e.g., by using a program such asGrabCut). Using segmentation annotation, each pixel is given an objectlabel or non-object label. For example, the original training sample set102 includes a training image with a palm tree and a training image witha bird. Using image-level annotation, the training image with the palmtree is labeled as an image containing a palm tree, and similarly thetraining image with the bird is labeled as an image containing a bird.Bounding box annotation identifies the bird or the palm tree as beingenclosed within a particular rectangle in the respective images. Usingsegmentation annotation, each pixel in the training image containing thebird is identified with a bird or non-bird label, and each pixel in thetraining image containing the palm tree is identified with a palm treeor non-palm tree label. The original training sample set 102 may be usedto construct one or more object recognition models for image processing.In an implementation, the one or more object recognition models mayemploy operations for learning multiple instances of an object, wherethe training images are annotated with some uncertainty in the unlabeledregions.

In an implementation, the original training sample set 102 is input intoa synthesizer module 104 to generate synthesized training images fromthe original training sample set 102. The synthesizer module 104 appliesone or more synthesizer models, including without limitation globaltransform models, object transform models, and object relocation models,to the original training sample set 102 to generate synthesized trainingimages. For example, the images may be globally rotated or scaled and/oran object may be relocated to a different region, either within the sameimage or in a different image. Different selection criteria may also beemployed to determine the appropriate region into which a specificobject is relocated.

The synthesized training images, together with the original sample set102, are collected into an enhanced training sample set 106, whichteaches the object recognition models new variations of objects ofinterest. For example, the enhanced training sample set 106 includesvariations of the palm tree and bird training images from the originaltraining sample set 102—the palm tree has been rotated and relocated toanother location within the same training image, and the training imagewith the bird has been globally transformed (e.g., flippedhorizontally). The annotations within a training image are transformedor relocated in the same manner as the object. Accordingly, manual orhuman input and/or intervention (e.g., labeling) is unnecessary for thesynthesized training images, so the amount of training data is increasedwithout human input.

The object recognition models are trained by the enhanced trainingsample set 106 and are implemented by an image processor engine 108.Based on the training, the image processor engine 108 may perform objectdetection operations, object segmentation operations, objectclassification operations, and/or other object recognition operationsfor various object types. Object detection locates an object in a givenimage and may identify the object by placing a bounding box around theobject. Object segmentation precisely identifies an object in a givenimage by selecting the pixels in the image that belong to the object.Object classification determines whether a given image contains anobject belonging to a particular class and labels the object accordingto that class.

In an implementation, an unknown image 110 is input into the imageprocessor engine 108, which performs object classification, detection,and/or segmentation operations on the unknown image 110 based on thetraining provided by the enhanced training sample set 106. The imageprocessor engine 108 outputs object processing results 112. The objectprocessing results 112 may include a detected, segmented, and/orclassified object from the unknown image 110, which may be presented ina graphical user interface, data readout, data stream, etc.

FIG. 2 illustrates example components in an object recognition system200 for selecting one or more synthesizer models to output an enhancedtraining sample set. The object recognition system 200 includes anoriginal training sample set 202, which is a set of training images andannotations for one or more object classes. The original training sampleset 202 is input into a synthesizer module 204, which employs one ormore synthesizer models to generate synthesized training images. Thesynthesizer models may include without limitation global transformmodels, object transform models, and object relocation models. Forexample, an object may be relocated to another image region, eitherwithin the original training image or to another background.

Based on the synthesizer models, the synthesizer module 204 collectstraining sample sets containing new synthesized training images that aregenerated from the original training sample set 202. The new trainingsample sets may be used to construct one or more object recognitionmodels. However, the object recognition models based on the trainingsample sets may have varying degrees of performance across differentobject classes. The variance in performance across different objectclasses results from different degrees of intra-class variation, thenecessity of context for certain object classes within a training image,and the comprehensiveness of the available training samples for eachobject class. Accordingly, a model selector module 206 analyzes thetraining sample sets output from the synthesizer module 204 based on thesynthesizer models to arbitrate between the synthesizer models on a perclass basis. The model selector module 206 validates the training samplesets based on the synthesizer models by determining which of thesynthesizer models improves the performance of the object recognitionsystem 200 for each object class. In one implementation, the trainingsample sets are tested against a validation set to determine theaccuracy of the object recognition performance of the object recognitionsystem 200 based on each training sample set. The model selector module206 selects the synthesizer model(s) that contain one or more variationsto the original training sample set 202 that lead to a satisfactoryincrease in the accuracy of the object recognition performance of theobject recognition system 200 for each object class. For some objectclasses, the original training sample set 202, un-augmented by asynthesizer model, may lead to an object recognition model with asatisfactorily increased accuracy (e.g., the highest accuracy).

It should be understood that a satisfactory increase in accuracy may bedefined by an accuracy condition. For example, the object recognitionsystem 200 may set an accuracy condition, wherein the highest accuracysatisfies the condition. Alternatively, an accuracy condition may be setto a threshold (e.g., 95%), such that any set of models that result inan accuracy above 95% may be selected. Other accuracy conditions mayalso be employed.

The model selector module 206 generates feedback to the synthesizermodule 204 by identifying which synthesizer model(s), if any, should beapplied to the original training sample set 202 for each object class toconstruct object recognition models with satisfactorily increasedperformance accuracy. The synthesizer module 204 applies the selectedsynthesizer model(s) to the object classes in the original trainingsample set 202 to output an enhanced training sample set 208. Theenhanced training sample set 208 may include the original trainingsample set 202 in addition to the synthesized training images generatedfor each class. The enhanced training sample set 208 includes selectedtraining sample sets for each object class that may be used to constructobject recognition model(s) that have a satisfactorily increasedaccuracy in performing object recognition operations for each class. Byselecting the variations to apply to the original training sample set202 for each object class, the accuracy of the object recognition system200 is significantly increased without human input.

FIG. 3 illustrates example components in an object recognition system300 for outputting object processing results based on an enhancedtraining sample set. The object recognition system 300 includes anenhanced training sample set 302, which includes training sample setsfor each object class. The enhanced training sample set 302 is inputinto an image processing engine 306 to train the image processing engine306 to perform various object recognition operations. Based on thetraining from the enhanced training sample set 306, the image processingengine 306 constructs one or more object recognition models, which haveimproved performance accuracy for each object class.

In one implementation, an unknown image 304 is input into the imageprocessing engine 306. The unknown image 304 may have one or moreobjects belonging to various object classes. The image processing engine306 implements the object recognition models constructed from theenhanced training sample set 302 to perform object recognitionoperations on the unknown image 304.

The image processing engine 306 includes a detection module 308, asegmentation module 310, and a classification module 312, which executethe object recognition models to perform the object recognitionoperations, such as detection, segmentation, and classification, forvarious object types. The image processing engine outputs objectprocessing results 314, which include one or more of a detected object316, a segmented object 318, and/or a classified object 320. The objectprocessing results 314 may be presented in a graphical user interface,data readout, data stream, etc.

For example, the detection module 308 processes the unknown image 304 tolocate an object in the unknown image 304. The detection module 308executes the object recognition models for various object classes todetermine if the unknown image 304 contains any known objects. Thedetection module 308 outputs the detected object 316 within the unknownimage 304. In one implementation, the detected object 316 is identifiedby a bounding box placed around the detected object 316.

The segmentation module 310 processes the unknown image 304 to preciselyidentify an object in the unknown image 304 by selecting which pixels inthe unknown image 304 belong to the object. The segmentation module 310executes the object recognition models for various object classes todetermine the precise boundaries of a particular object in the unknownimage 304. The segmentation module 310 outputs the segmented object 318within the unknown image 304. In one implementation, the segmentedobject 318 is outlined within the unknown image 304. In anotherimplementation, the segmentation module 310 uses a segmentation mask tocut the segmented object 318 from the unknown image 304.

The classification module 312 processes the unknown image 304 todetermine whether the unknown image 304 contains an object belonging toa particular object class. The classification module 312 executes theobject recognition models for various object classes to determine whichobject classes the unknown image 304 contains objects belonging to. Theclassification module 312 outputs the classified object 320 from theunknown image 304. In one implementation, the unknown image 304 isidentified as containing the classified object 320 belonging to aparticular object class.

FIG. 4 illustrates an example model selection system 400 for identifyingone or more synthesizer models for each object class in an objectrecognition operation. The model selection system 400 includes originaltraining samples 402, containing training images with objects fromvarious object classes. The training images contained in the originaltraining samples 402 are annotated to identify objects in each trainingimage.

One or more synthesizer models 404 are applied to the original trainingsamples 402 to obtain a collection of synthesized training sample setsfor each object class. When one or more of the synthesizer models 404 isapplied to the original training samples 402, the annotations withineach training image of the original training samples 402 are transformedor otherwise augmented according to the synthesizer model applied to theimage. The synthesizer models 404 include a null model 406, a globaltransform model 408, an object transform within a same image model 410,and an object transform within a different background model 412. Theobject transform within a same image model 410 and the object transformwithin the different background model 412 are examples of object-basedtransform models. Object-based transform models constrain transforms topixels in or associated with an object, in contrast to global transformsmodels, which perform transforms on an entire image. However, othersynthesizer models based on one or more variations to a training imageor other object based transforms may be applied to the original data402. For example, a synthesizer model that adds new objects to atraining image from a preset database may be applied to the originaltraining samples 402, in combination with one or more of the synthesizermodels 406, 408, 410, and 412. The new objects may be added to vary theimage context of an object. Additionally, each training image in thesynthesized training sample sets may be normalized to ensure thatrealistic training images are generated.

For example, the original training samples 402 are input into the nullmodel 406. The null model 406 does not augment the original trainingsamples 402 or generate additional synthesized training images.Accordingly, the null model 406 outputs the original training samples402.

When the original training samples 402 is input into the globaltransform model 408, each training image is augmented as a whole.However, each training image is transformed independently from othertraining images. The global transform model 408 applies one or both ofphotometric or geometric transforms to a training image. For example, atraining image may be globally rotated or scaled.

The object transform within the same image model 410 relocates orotherwise transforms an object within the training image from which itoriginated. For example, the original training samples 402 is input intothe object transform within the same image model 410. Each trainingimage is analyzed to sequentially locate all objects within the trainingimage to relocate each image to a different region within the sametraining image. Additionally, each object may be scaled according to anarticulated scaling schedule.

The object transform within a different background model 412 relocatesan object to a different background image. In one implementation, thetraining image into which the object is to be relocated may be selectedbased on the average grayscale intensity of the image background ascompared to the grayscale intensity of the training from which theobject originated. In another implementation, the training image intowhich the object is to be relocated may be selected based on thepresence of co-occurring objects in the background.

In one implementation, for each object class, the original trainingsamples 402 containing training images with objects from a particularclass is input into each of the synthesizer models 406, 408, 410, and412 separately. In another implementation, multiple synthesizer models404 are applied to the training images in original training samples 402containing objects from the particular class concurrently. Thesynthesized training sample sets result from application of eachsynthesizer model 406, 408, 410, and 412 separately or from applicationof a combination of the synthesizer models 406, 408, 410, and 412 areeach separately input into an object recognition engine 414. The objectrecognition engine 414 performs object recognition operations on thereceived synthesized training sample set, including without limitationobject detection, object segmentation, and object classification. Theobjection recognition engine 414 outputs object processing results intoa validation engine 416. The object processing results may includedetected, segmented, or classified object from the synthesized trainingsample set for each object type in a validation sample set.

The validation engine 416 analyzes the object processing resultsaccording to a performance measurement score (e.g., by assigning anaverage precision (AP) score to each synthesizer model for each objectclass). In one implementation the validation engine 416 presents theobject recognition results to a human annotator in a graphical userinterface, data readout, data stream, etc. for the annotator to assign aperformance measurement score (e.g., an AP score) to a synthesizer modelfor each object type. The performance measurement scores and/orvalidation sample set are input into a model selector engine 418, whichanalyzes the performance measurement scores and/or validation sample setto select the synthesizer model for each object class. The modelselector engine 418 selects the synthesizer model with a satisfactorilyincreased object recognition performance accuracy for each object class.The model selector engine 418 outputs the selected synthesizer model 420for each object class. In one implementation, the synthesizer model 420is applied to the original training samples 402 to obtain an enhancedtraining sample set from which a validated object recognition model foreach object class can be constructed. In another implementation, thesynthesizer model 420 is applied to a validation training samples set toobtain an enhanced training sample set.

FIG. 5 illustrates an example object recognition system 500 forintroducing variations to an original training sample set based on oneor more global transform models. The object recognition system 500includes an original training sample set 502, containing training imageswith objects from various object classes. The training images containedin the original training samples 502 are annotated to identify objectsin each training image. One or more global transform models 504transform or otherwise augment a given training image as a whole,including any objects and the background. The annotations aretransformed with their corresponding objects and/or background.

The global transform models 504 are applied to the original trainingsample set 502 to generate a synthesized training sample set 510 foreach object type. The global transform models 504 generate synthesizedtraining images that introduce, for example, poses, scales, andappearances of objects of interest that are different from the trainingimages in the original training sample set 502. In one implementation,the global transform models 504 include a photometric transform model506 and a geometric transform model 508. The synthesized training sampleset 510 includes training images generated by one or both of thephotometric transform model 506 and the geometric transform model 508.

The photometric transform model 506 globally augments the appearance ofa training image from the original training sample set 502 by varyingthe photometric characteristics, including without limitation thebrightness, luminosity, color balance, and contrast, of the trainingimage. The photometric transform model 506 applies a photometrictransform independently of other training images in the originaltraining sample set 502. The photometric transform may be linear ornon-linear.

The geometric transform model 508 globally transforms a given trainingimage from the original training sample set 502 by applying a linear ornon-linear geometric transformation including without limitationrotating, flipping, or scaling the training image. In oneimplementation, in-plane scaling, flipping and rotation are appliedindependently. In another implementation, the training image is bothrotated and scaled in-plane. For example, each training image in theoriginal training sample set 502 may be rotated four times to generatefour different variations of each training image. The range of therotation angle may be dynamically set based on each particular trainingimage, which prevents out-of-image borders. To determine the range ofrotation angles for each training image, the aspect ratio of the largestobject in the training image is computed. The aspect ratio is the ratiobetween the minimum object bounding box dimension and the maximumdimension. An additional example includes scaling each training image inthe original training sample set 502 using four constant scales togenerate four variations of the same training image. The selected scalesmay be selected experimentally to determine the necessary scaling tooutput a synthesized training sample set 510 that leads to higheraccuracy.

The synthesized training sample set 510 may be combined with theoriginal training sample set 502 to form an enhanced training sample set512, which contains training images for object types that had a higherobject recognition performance accuracy based on the application of oneor more of the global transform models 504.

FIG. 6 illustrates example original images transformed based on globaltransform models. A table 600 shows three examples A, B, and C of theapplication of global transform models to an original image containing apalm tree.

Example A shows the application of a geometric global transform model.The original image is flipped horizontally to generate a transformedimage with a palm tree facing the opposite direction as in the originalimage. The transformed image provides a new pose for the palm tree.

Example B shows the original image containing the palm tree transformedby a geometric global transform model. The original image is rotatedclockwise to generate a transformed image with an angled palm tree. Thetransformed image provides a new pose for the palm tree.

Example C shows the application of a photometric global transform model.The brightness of the original image is augmented to generate atransformed image with a palm tree that is darker than the originalimage. The transformed image provides a new appearance for the palmtree.

FIG. 7 illustrates an example object recognition system 700 forintroducing variations to an original training sample set based on oneor more object transform within the same image models. The objectrecognition system 700 includes an original training sample set 702,containing training images with objects from various object classes. Thetraining images contained in the original training samples 702 areannotated to identify objects in each training image. If an objectwithin a training image is transformed or otherwise augmented, theannotations corresponding to that object are also transformed.

One or more object transform within the same image models 704 areapplied to the original training sample set 702 to generate asynthesized training sample set 714 for each object type. The objecttransform within the same image models 704 generate synthesized trainingimages that introduce poses, scales, and contexts of objects of interestthat are different from the training images in the original trainingsample set 702. For each training image from the original trainingsample set 702, all objects within the training image are processedsequentially to relocate or scale each object within the same trainingimage. The object transform within the same image models 704 include anobject rotation model 706, an object transform to a different regionmodel 708, a scaling model 710, and a region filling model 712, whicheach operate to increase generalization of an object type whilemaintaining some image context for each object. For each of the objecttransform within the same image models 704, a segmentation mask may beapplied to a training image to precisely identify and segment an objectwithin the training image. However, bounding boxes and abackground-foreground extraction method (e.g., GrabCut) to separate theforeground from the background within the bounding box may be employedto augment the training images with the object transform within the sameimage models 704.

The object rotation model 706 identifies all objects within a trainingimage in the original training sample set 702. One by one, each objectis sequentially rotated within the same training image from which theobject originated. The rotations applied may include, withoutlimitation, flipping an object horizontally, flipping an objectvertically, rotating an object according to a range of rotation anglesthat are dynamically set based on the particular training image, or someother combination.

The object transform to a different region model 708 displaces one ormore objects from a training image in the original training sample set702 and relocates each of these objects to another region within thesame training image. Placement of each object may be carefully selectedto ensure that objects are not overlapping after relocation. To avoidoverlapping objects, the object transform to a different region model708 may relocate an object to a background region or a region having alarge background to foreground ratio. Further, a region within thetraining image to relocate an object to may be chosen based on imagecontext to ensure that variations are introduced to the originaltraining sample set 702. For example, a car is generally shown on a roador similar region, so relocating the car to a region in the trainingimage showing a body of water where a car would not generally be locatedis, for many applications, not an appropriate variation. Additionally,the object transform to a different region model 708 may be applied incombination with the scaling model 710 to ensure that an object isrelocated to a region that will not result in overlapping objects.

The scaling model 710 uses constant or dynamic scales to introducevariations to all objects from a training image in the original trainingsample set 702. The specific scale for an object is determined based onthe region in the training image into which the object is to berelocated. In one implementation, a sliding window is placed over aregion in a training image to identify a target bounding box to relocatean object, with an original bounding box, from the same training imageinto. The target bounding box is scaled according to a scaling schedulerelative to the original object bounding box. For example, 0.5, 0.75,and 1.0 may be utilized as scale down factors of the original objectbounding box size. Additionally, scale up factors may also be utilized.The object is scaled according to the selected scaling factor and may berelocated to the target bounding box in a different image region.

In one implementation, the region filling model 712 increases the amountof available background within a training image in the original trainingsample set 702. When an object is rotated, scaled, or relocated to adifferent region within a training image, a hole in the training imageis created. The region filing model 712 fills the pixels correspondingthe hole in the training image to match the background of the trainingimage. Accordingly, another object may be relocated to the newbackground image region. In another implementation, various methods ofcompositing an object against a new background image region may be used,including without limitation simple masking, feathering, alpha-maskblending (i.e., transparency), Poisson blending, etc.

The synthesized training sample set 714 includes training imagesgenerated by one or more of the object transform within the same imagemodels 706, 708, 710, and 712. For example, an object may be rotated,relocated, and/or scaled, and the region from which the object wasremoved may be filled. The synthesized training sample set 714 may becombined with the original training sample set 702 to form an enhancedtraining sample set 716, which contains training images for object typesthat had a higher object recognition performance accuracy based on theapplication of one or more of the object transform within the same imagemodels 704.

FIG. 8 illustrates example original images transformed based on objecttransform within the same image models. A table 800 shows three examplesA, B, and C of the application of object transform within the same imagemodels to an original image containing a palm tree and a bird. Each ofthe examples A, B, and C increase the generalization for the bird objecttype while allowing the bird object type to maintain image context(e.g., birds are generally located around trees).

Example A shows the application of an object rotation model. The bird inthe original image is flipped horizontally to generate a transformedimage with the bird facing the opposite direction as in the originalimage. The transformed image provides a new pose for the bird, while thepalm tree and the remainder of the background of the original image areun-augmented.

Example B shows the original image containing the palm tree and the birdtransformed by an object transform to a different region model. The birdin the original image is displaced and relocated to a region in theoriginal image closer to the palm tree. The transformed image provides adifferent background region for the bird.

Example C shows the application of an object transform to a differentregion model and a scaling model. The bird in the original image isdisplaced and relocated to a region in the original image closer to thepalm tree, and the bird is scaled to maximize the background toforeground ratio. The transformed image provides a different backgroundregion and pose for the bird.

FIG. 9 illustrates an example object recognition system 900 forintroducing variations to an original training sample set based on oneor more object transform within a different background models. Theobject recognition system 900 includes an original training sample set902, containing training images with objects from various objectclasses. The training images contained in the original training samples902 are annotated to identify objects in each training image. If anobject within a training image is transformed or otherwise augmented,the annotations corresponding to that object are also transformed.

One or more object transform within a different background models 904are applied to the original training sample set 902 to generate asynthesized training sample set 910 for each object type. The objecttransform within a different background models 904 generate synthesizedtraining images that introduce variations in the background of objectsof interest that are different from the training images in the originaltraining sample set 902. In one implementation, the object transformwithin a different background models 904 relocates a source object to atarget image from the original training sample set 902. In anotherimplementation, the object transform within a different backgroundmodels 904 relocates a source object to a background selected from apredetermined set of images separate from the original training sampleset 902. The object transform within a different background models 904include a background value model 906, and a co-occurring objects model908, which each operate to reduce the dependence on context forperforming object recognition operations for an object by providingdifferent scenery for an object type. For each of the object transformwithin a different background model 904, a segmentation mask may beapplied to a training image to precisely identify and segment an objectwithin the training image. However, bounding boxes and abackground-foreground extraction method (e.g., GrabCut) to separate theforeground from the background within the bounding box may be employedto augment the training images with the object transform within adifferent background model 904.

The background value model 906 characterizes the background of atraining image from the original training sample set 902 by its averagegrayscale or color intensity value Ī_(B) (average background value). Atarget training image into which an object is to be relocated from asource training image may be selected based on the relation between theaverage background value of the target training image and the sourcetraining image. In one implementation, a target training image isselected by specifying a random background value and locating trainingimages with an average background value relatively close to the randomvalue. In another implementation, a target training image is selected bylocating training images with a Ī_(B) value relatively far from thesource training image. The background value model 906 clusters thetraining images from the original training sample set 902 into kcategories according to Ī_(B) value using a uniform quantization of therange of average background values. For each object from a sourcetraining image to be relocated, an image category index h_(s) iscomputed. A target training image is selected from training imagescollected by applying the following rule, where h_(D) is a destinationcategory index:

$h_{D} = {h_{S} + \frac{k}{2}}$

Accordingly, the background model 906 bases object relocation on thefarthest background appearance with varying background categories.

The co-occurring objects model 908 relocates a source object to abackground region in a training image in the original training sampleset 902 that contains one or more objects that are high co-occurringwith the source object. The co-occurring objects model 908 provides anew background while maintaining image context of the source object byrelocating it to a training image with co-occurring objects. The targettraining image into which the source object is to be relocated isselected based on object co-occurrence statistics. In oneimplementation, the co-occurrence statistics of all objects in theoriginal training sample set 902 are computed. For a specific object Oin a source image, the top co-occurring object classes R are identified.The highest occurring co-occurring class (i.e., R=1) is identified anddenoted C_(Z). All training images belonging to the class C_(Z) areidentified and ranked according to predetermined rules. For example, ifthe scale of the source object O with respect to its class, denoted byS_(Rel) _(O) , is similar to the scale of the highest co-occurringobject in a target training image, denoted by S_(Rel) _(K) , then theobject O may be relocated into the target training image relativelyeasily. The relative scale of the source object O is defined as follows:

$S_{{Rel}_{O}} = \frac{S_{O} - {\overset{\_}{S}}_{C{(O)}}}{{\max \left( S_{C{(O)}} \right)} - {\min \left( S_{C{(O)}} \right)}}$

S_(o) is measured by the area of a bounding box of the object O, S_(C(O)) is the average bounding box area of the source object O's class,max(S_(C(O))) is the maximum area, and min(S_(C(O))) is the minimum ofall object instances in the class of the source object O. The trainingimages belonging to the class C_(Z) are categorized based on theirsimilarity with the source object O in terms of relative scale using thefollowing rule:

${Sim} = \frac{1}{1 + \left( {S_{{Rel}_{O}} - S_{{Rel}_{K}}} \right)^{2}}$

Accordingly, the co-occurring objects model 908 relocates the sourceobject O to a training image that is the most similar in context to thesource training image based on co-occurring objects.

The synthesized training sample set 910 includes training imagesgenerated by one or more of the object transform within a differentbackground models 906 and 908. The synthesized training sample set 910may be combined with the original training sample set 902 to form anenhanced training sample set 912, which contains training images forobject types that had a higher object recognition performance accuracybased on the application of one or more of the object transform within adifferent background models 904.

FIG. 10 illustrates example original images transformed based on objecttransform within a different background models. A table 1000 shows twoexamples A and B of the application of object transform within adifferent background models to an original image.

Example A shows the application of a co-occurring objects model. Theoriginal image contains a bird flying by a tree. The co-occurringobjects model determines that birds are frequently co-occurring withtrees, and collects images that contain trees. The co-occurring objectsmodel selects a target image from the collected images that is the mostsimilar in context to the original image. The transformed image showsthe bird flying near a palm tree. The transformed image provides newscenery for the bird while maintaining the context.

Example B shows an original image containing a cat and a planttransformed by a background value model. A target training image intowhich the cat is to be relocated is selected based on the relationbetween the average background value of the target training imagecontaining the cactus and the source training image containing theplant. The transformed image provides new scenery for the cat.

FIG. 11 illustrates example operations 1100 for identifying one or moresynthesized image generating models to generate a synthesized trainingsample set from an original training sample set for image processing. Anapplication operation 1102 applies one or more transform models tooriginal training samples, which is a set of training images andannotations for one or more object classes. The training imagescontained in the original training samples are annotated by using amanual labeling method (e.g., bounding boxes) to specify which pixels ina training image belong to an instance of a specific object type (e.g.,birds and trees). All pixels not belonging to a specific object in atraining image are annotated as background.

In an implementation, synthesized training images are generated from theoriginal training samples. One or more transform models, includingwithout limitation global transform models and object-based transformmodels, are applied to the original training samples to generatesynthesized training images. For example, the images may be globallyrotated or scaled and/or an object may be relocated to a differentregion, either within the same image or in a different image. Differentselection criteria may also be employed to determine an appropriateregion into which a specific object is to be relocated.

A selection operation 1104 selects one or more transform models for eachobject class based on validated performance improvement. Based on thetransform models, new training sample sets containing new synthesizedtraining images that are generated from the original training samplesare collected. The new training sample sets may be used to construct oneor more object recognition models. However, the object recognitionmodels based on the training sample sets may have varying degrees ofperformance across different object classes. Accordingly, the trainingsample sets based on the transform models are analyzed to arbitratebetween the transform models on a per class basis. The training samplesets based on the transform models are validated by determining which ofthe transform models improves the performance of the object recognitionoperations for each object class. The transform model(s) that containone or more variations to the original training samples which will leadto a satisfactory increase in the accuracy of the object recognitionperformance operations for each object class are selected. For someobject classes, the original training samples, un-augmented by atransform model, may lead to an object recognition model with asatisfactorily increased accuracy. Accordingly the selection operation1104 identifies which transform model(s), if any, should be applied tothe original training samples for each object class to construct objectrecognition models with a satisfactorily increased performance accuracy.

An application operation 1106 applies the selected transform model(s)from the selection operation 1104 to the original training samples togenerate synthesized training samples for each object class. Thesynthesized training samples includes training samples for each objectclass that may be used to construct object recognition model(s) thathave a satisfactorily increased accuracy in performing objectrecognition operations for each class. By selecting the variations toapply to the original training samples for each object class, theaccuracy of the object recognition operations is significantly increasedwithout human input. An output operation 1108 outputs one or more objectrecognition models for each object class constructed from the trainingsamples containing the original training samples and the synthesizedtraining samples generated in the application operation 1106.

FIG. 12 illustrates example operations 1200 for performing imageprocessing on an unknown image based on training samples. A receivingoperation 1202 receives training samples containing original trainingsamples and synthesized training samples into an image processingengine. The original training samples represent a set of training imagesand annotations for one or more object classes. The training imagescontained in the original training samples are annotated by using amanual labeling method (e.g., bounding boxes) to specify which pixels ina training image belong to an instance of a specific object type (e.g.,birds and trees). All pixels not belonging to a specific object in atraining image are annotated as background. The synthesized trainingsamples includes training samples for each object class that may be usedto construct object recognition model(s) that have a satisfactorilyincreased accuracy in performing object recognition operations for eachclass. The original training samples and synthesized training samplesare used to train the image processing engine to perform various objectrecognition operations. Based on the training, the image processingengine constructs one or more object recognition models, which havesatisfactorily increased performance accuracy for each object class.

A receiving operation 1204 receives an unknown image into the imageprocessing engine. The unknown image may have one or more objectsbelonging to various object classes. The image processing engineimplements the object recognition models constructed from the originaltraining samples and the synthesized training samples to perform objectrecognition operations on the unknown image.

A performing operation 1206 performs image processing on the unknownimage using one or more image processing modules. The performingoperation 1206 executes the object recognition models to perform theobject recognition operations, such as detection, segmentation, andclassification, for various object types. Object detection locates anobject in a given image and may identify the object by placing abounding box around the object. Object segmentation precisely identifiesan object in a given image by selecting which pixels in the image belongto the object. Object classification determines whether a given imagecontains an object belonging to a particular class.

An output operation 1208 outputs object processing results, whichinclude one or more of a detected object, a segmented object, and/or aclassified object. The output operation 1208 may present the objectprocessing results in a graphical user interface, data readout, datastream, etc.

FIG. 13 illustrates an example system that may be useful in implementingthe described technology. The example hardware and operating environmentof FIG. 13 for implementing the described technology includes acomputing device, such as general purpose computing device in the formof a gaming console or computer 20, a mobile telephone, a personal dataassistant (PDA), a set top box, or other type of computing device. Inthe implementation of FIG. 13, for example, the computer 20 includes aprocessing unit 21, a system memory 22, and a system bus 23 thatoperatively couples various system components including the systemmemory to the processing unit 21. There may be only one or there may bemore than one processing unit 21, such that the processor of computer 20comprises a single central-processing unit (CPU), or a plurality ofprocessing units, commonly referred to as a parallel processingenvironment. The computer 20 may be a conventional computer, adistributed computer, or any other type of computer; the invention isnot so limited.

The system bus 23 may be any of several types of bus structuresincluding a memory bus or memory controller, a peripheral bus, aswitched fabric, point-to-point connections, and a local bus using anyof a variety of bus architectures. The system memory may also bereferred to as simply the memory, and includes read only memory (ROM) 24and random access memory (RAM) 25. A basic input/output system (BIOS)26, containing the basic routines that help to transfer informationbetween elements within the computer 20, such as during start-up, isstored in ROM 24. The computer 20 further includes a hard disk drive 27for reading from and writing to a hard disk, not shown, a magnetic diskdrive 28 for reading from or writing to a removable magnetic disk 29,and an optical disk drive 30 for reading from or writing to a removableoptical disk 31 such as a CD ROM, a DVD, or other optical media.

The hard disk drive 27, magnetic disk drive 28, and optical disk drive30 are connected to the system bus 23 by a hard disk drive interface 32,a magnetic disk drive interface 33, and an optical disk drive interface34, respectively. The drives and their associated computer-readablemedia provide nonvolatile storage of computer-readable instructions,data structures, program modules and other data for the computer 20. Itshould be appreciated by those skilled in the art that any type ofcomputer-readable media which can store data that is accessible by acomputer, such as magnetic cassettes, flash memory cards, digital videodisks, random access memories (RAMs), read only memories (ROMs), and thelike, may be used in the example operating environment.

A number of program modules may be stored on the hard disk, magneticdisk 29, optical disk 31, ROM 24, or RAM 25, including an operatingsystem 35, one or more application programs 36, other program modules37, and program data 38. A user may enter commands and information intothe personal computer 20 through input devices such as a keyboard 40 andpointing device 42. Other input devices (not shown) may include amicrophone, joystick, game pad, satellite dish, scanner, or the like.These and other input devices are often connected to the processing unit21 through a serial port interface 46 that is coupled to the system bus,but may be connected by other interfaces, such as a parallel port, gameport, or a universal serial bus (USB). A monitor 47 or other type ofdisplay device is also connected to the system bus 23 via an interface,such as a video adapter 48. In addition to the monitor, computerstypically include other peripheral output devices (not shown), such asspeakers and printers.

The computer 20 may operate in a networked environment using logicalconnections to one or more remote computers, such as remote computer 49.These logical connections are achieved by a communication device coupledto or a part of the computer 20; the invention is not limited to aparticular type of communications device. The remote computer 49 may beanother computer, a server, a router, a network PC, a client, a peerdevice or other common network node, and typically includes many or allof the elements described above relative to the computer 20, althoughonly a memory storage device 60 has been illustrated in FIG. 13. Thelogical connections depicted in FIG. 13 include a local-area network(LAN) 51 and a wide-area network (WAN) 52. Such networking environmentsare commonplace in office networks, enterprise-wide computer networks,intranets and the Internet, which are all types of networks.

When used in a LAN-networking environment, the computer 20 is connectedto the local network 51 through a network interface or adapter 53, whichis one type of communications device. When used in a WAN-networkingenvironment, the computer 20 typically includes a modem 54, a networkadapter, a type of communications device, or any other type ofcommunications device for establishing communications over the wide areanetwork 52. The modem 54, which may be internal or external, isconnected to the system bus 23 via the serial port interface 46. In anetworked environment, program modules depicted relative to the personalcomputer 20, or portions thereof, may be stored in the remote memorystorage device. It is appreciated that the network connections shown areexample and other means of and communications devices for establishing acommunications link between the computers may be used.

In an example implementation, a synthesizer module, image processorengine, model selector module, and other modules and services may beembodied by instructions stored in memory 22 and/or storage devices 29or 31 and processed by the processing unit 21. Training sample sets,unknown images, synthesized image generating models, and other data maybe stored in memory 22 and/or storage devices 29 or 31 as persistentdatastores.

Some embodiments may comprise an article of manufacture. An article ofmanufacture may comprise a storage medium to store logic. Examples of astorage medium may include one or more types of computer-readablestorage media capable of storing electronic data, including volatilememory or non-volatile memory, removable or non-removable memory,erasable or non-erasable memory, writeable or re-writeable memory, andso forth. Examples of the logic may include various software elements,such as software components, programs, applications, computer programs,application programs, system programs, machine programs, operatingsystem software, middleware, firmware, software modules, routines,subroutines, functions, methods, procedures, software interfaces,application program interfaces (API), instruction sets, computing code,computer code, code segments, computer code segments, words, values,symbols, or any combination thereof. In one embodiment, for example, anarticle of manufacture may store executable computer programinstructions that, when executed by a computer, cause the computer toperform methods and/or operations in accordance with the describedembodiments. The executable computer program instructions may includeany suitable type of code, such as source code, compiled code,interpreted code, executable code, static code, dynamic code, and thelike. The executable computer program instructions may be implementedaccording to a predefined computer language, manner or syntax, forinstructing a computer to perform a certain function. The instructionsmay be implemented using any suitable high-level, low-level,object-oriented, visual, compiled and/or interpreted programminglanguage.

The embodiments of the invention described herein are implemented aslogical steps in one or more computer systems. The logical operations ofthe present invention are implemented (1) as a sequence ofprocessor-implemented steps executing in one or more computer systemsand (2) as interconnected machine or circuit modules within one or morecomputer systems. The implementation is a matter of choice, dependent onthe performance requirements of the computer system implementing theinvention. Accordingly, the logical operations making up the embodimentsof the invention described herein are referred to variously asoperations, steps, objects, or modules. Furthermore, it should beunderstood that logical operations may be performed in any order, unlessexplicitly claimed otherwise or a specific order is inherentlynecessitated by the claim language.

Although the subject matter has been described in language specific tostructure features and/or methodological acts, it is to be understoodthat the subject matter defined in the appended claims is notnecessarily limited to the specific features or acts described above.Rather the specific features and acts described above are disclosed asexample forms of implementing the claims.

The above specification, examples, and data provide a completedescription of the structure and use of exemplary implementations of thedescribed technology. Since many implementations can be made withoutdeparting from the spirit and scope of the described technology, theinvention resides in the claims hereinafter appended. Furthermore,structural features of the different embodiments may be combined in yetanother embodiment without departing from the recited claims.

1. A method comprising: generating synthesized training samples byapplying one or more object-based transform models to original trainingsamples; validating the one or more object-based transform models basedon image processing accuracy for an object class, based on thesynthesized training samples; and selecting one or more object-basedtransform models from the one or more validated object-based transformmodels, the selected object-based transform models having asatisfactorily increased image processing accuracy for the object class.2. The method of claim 1 wherein the one or more object-based transformmodels includes an object-based transform model that relocates an objectfrom a source image to a different image region.
 3. The method of claim2 wherein the different image region is within the source image.
 4. Themethod of claim 2 wherein the different image region is within abackground of a target image.
 5. The method of claim 4 wherein thetarget image is selected based on a similarity of background valuebetween the source image and the target image.
 6. The method of claim 4wherein the target image is selected based on a presence of one or moreco-occurring objects.
 7. The method of claim 1 wherein the object-basedtransform model scales an object.
 8. The method of claim 1 wherein theobject-based transform model rotates an object.
 9. One or morecomputer-readable storage media encoding computer-executableinstructions for executing on a computer system a computer process, thecomputer process comprising: generating synthesized training samples byapplying one or more object-based transform models to original trainingsamples.
 10. The one or more computer-readable storage media of claim 9wherein the one or more object-based transform models include anobject-based transform model that relocates an object from a sourceimage to a different image region.
 11. The one or more computer-readablestorage media of claim 10 wherein the different image region is withinthe source image.
 12. The one or more computer-readable storage media ofclaim 10 wherein the different image region is within a background of atarget image.
 13. The one or more computer-readable storage media ofclaim 12 wherein the target image is selected based on a similarity ofbackground value between the source image and the target image.
 14. Theone or more computer-readable storage media of claim 12 wherein thetarget image is selected based on a presence of one or more co-occurringobjects.
 15. A system comprising: a synthesizer module configured togenerate synthesized training samples by applying one or moreobject-based transform models to original training samples.
 16. Thesystem of claim 15 wherein the one or more object-based transform modelsincludes an object-based transform model that relocates an object from asource image to a different image region.
 17. The system of claim 16wherein the different image region is within the source image.
 18. Thesystem of claim 16 wherein the different image region is within abackground of a target image.
 19. The system of claim 18 wherein thetarget image is selected based on a similarity of background valuebetween the source image and the target image.
 20. The system of claim18 wherein the target image is selected based on a presence of one ormore co-occurring objects.